Skip to content

Add evals/: schema-rejection and tool-retrieval regression coverage#62

Merged
rajeeja merged 2 commits into
mainfrom
rajeeja/evals
Jun 10, 2026
Merged

Add evals/: schema-rejection and tool-retrieval regression coverage#62
rajeeja merged 2 commits into
mainfrom
rajeeja/evals

Conversation

@rajeeja

@rajeeja rajeeja commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Two cheap, runnable evals under evals/ that turn behavior we care about into numbers we can re-measure on every PR.

  • evals/schema_rejection/ — 21 calls (19 deliberately malformed, 2 baselines). Classifies each outcome by layer (schema / IO / runtime / silent) and reports caught_rate. Currently 94.7% with 1 silent passplot_dataset(plot_type='variable') accepts a call with no variable_name and returns a plot anyway. That's a real bug surfaced by the eval; tracked separately, not fixed in this PR.
  • evals/tool_retrieval/ — BM25 over the full ~54-function tool surface against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection accuracy and the mean rank of the correct tool. Currently 77% top-1, 87% top-3, 93% top-5.

Both runners complete in under 30 seconds with no external dependencies. Eval result JSON files are gitignored; the runners themselves are the source of truth.

evals/README.md explains what an eval is for a non-AI engineer and lists when to add one vs. when to write a unit test.

Test plan

  • uv run pre-commit run --all-files — passes.
  • uv run pytest tests/ --ignore=tests/test_remote_agent.py — 295 passed.
  • uv run python -m evals.schema_rejection.run — completes; 1 known silent-pass bug reported.
  • uv run python -m evals.tool_retrieval.run — completes; 77 / 87 / 93 numbers reproduce.

rajeeja added 2 commits June 10, 2026 14:58
Two cheap, runnable evals that turn behavior we care about into numbers we
can re-measure on every PR:

- evals/schema_rejection/ — 21 calls (19 deliberately malformed, 2 baselines)
  classify each outcome by layer (schema / IO / runtime / silent). Headline
  number is caught_rate. Currently 94.7% with 1 silent pass (plot_dataset
  with plot_type='variable' but no variable_name still returns a plot).

- evals/tool_retrieval/ — BM25 over the full ~54-function tool surface
  against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection
  accuracy and mean rank of the correct tool. Currently 77% / 87% / 93%.

Both runners run in under 30 seconds with no external dependencies. Result
JSON files are gitignored; the runners are the source of truth.

evals/README.md explains what an eval is for a non-AI engineer and lists
when to add new ones vs. when to write a unit test instead.
…ually type

Targeted the 7 tools that ranked worst in the BM25 retrieval eval — rewrote
each first line to include the words a user would naturally use ("wireframe",
"colored map", "ensemble", "time average", "is the endpoint healthy", "start
a new session", "list variables") rather than internal jargon.

evals/tool_retrieval results, same 30-prompt set:

  before:  top-1 77%, top-3 87%, top-5 93%, mean rank 2.33, worst rank 19
  after:   top-1 93%, top-3 100%, top-5 100%, mean rank 1.07, worst rank 2

The two remaining rank-2 cases are genuinely ambiguous (plot_mesh vs.
plot_mesh_geo; inspect_variable vs. get_capabilities) and the right ones
land in the top-3 shortlist — which is what discover_tools will return.

Tools touched: create_session, calculate_temporal_mean, calculate_ensemble_mean,
diagnose_endpoint, inspect_variable, plot_mesh, plot_variable, plot_mesh_geo,
get_capabilities. Behavior unchanged; only the leading docstring sentence
moves.

Pre-commit (including mypy) and the full test suite (295 tests) pass.
@rajeeja rajeeja merged commit 1d9d947 into main Jun 10, 2026
7 checks passed
@rajeeja rajeeja deleted the rajeeja/evals branch June 10, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant